126 research outputs found

    5,13-Disulfamoyl-1,9-diazatetracyclo[7.7.1.02,7.010,15]heptadeca-2(7),3,5,10,12,14-hexaen-1-ium chloride

    Get PDF
    In the title salt, C15H17N4O4S2 +·Cl−, the chloride anion is disordered over two positions with occupancies of 0.776 (6) and 0.224 (6). The cation adopts an L shape and the dihedral angle between the benzene rings is 82.5 (3)°. In the crystal, inversion dimers of cations linked by pairs of N—H⋯N hydrogen bonds occur, with the bond arising from the protonated N atom. The cationic dimers are linked into chains via the disordered chloride ions by way of N—H⋯Cl hydrogen bonds and N—H⋯O, C—H⋯O and C—H⋯Cl inter­actions also occur, which help to consolidate the three-dimensional network

    Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots

    Full text link
    Improving the generalization capabilities of general-purpose robotic agents has long been a significant challenge actively pursued by research communities. Existing approaches often rely on collecting large-scale real-world robotic data, such as the RT-1 dataset. However, these approaches typically suffer from low efficiency, limiting their capability in open-domain scenarios with new objects, and diverse backgrounds. In this paper, we propose a novel paradigm that effectively leverages language-grounded segmentation masks generated by state-of-the-art foundation models, to address a wide range of pick-and-place robot manipulation tasks in everyday scenarios. By integrating precise semantics and geometries conveyed from masks into our multi-view policy model, our approach can perceive accurate object poses and enable sample-efficient learning. Besides, such design facilitates effective generalization for grasping new objects with similar shapes observed during training. Our approach consists of two distinct steps. First, we introduce a series of foundation models to accurately ground natural language demands across multiple tasks. Second, we develop a Multi-modal Multi-view Policy Model that incorporates inputs such as RGB images, semantic masks, and robot proprioception states to jointly predict precise and executable robot actions. Extensive real-world experiments conducted on a Franka Emika robot arm validate the effectiveness of our proposed paradigm. Real-world demos are shown in YouTube (https://www.youtube.com/watch?v=1m9wNzfp_4E ) and Bilibili (https://www.bilibili.com/video/BV178411Z7H2/ )

    AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

    Full text link
    We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks, such as making a smiley face using building blocks. These tasks often involve complex multi-step reasoning, presenting significant challenges due to the limited paired data connecting human instructions (e.g., making a smiley face) and robot actions (e.g., end-effector movement). Existing approaches relieve this challenge by adopting an open-loop paradigm decomposing high-level instructions into simple sub-task plans, and executing them step-by-step using low-level control models. However, these approaches are short of instant observations in multi-step reasoning, leading to sub-optimal results. To address this issue, we propose to automatically collect a cognitive robot dataset by Large Language Models (LLMs). The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation sequences. To enable efficient data acquisition, we employ elaborated multi-round prompt designs that effectively reduce the burden of extensive human involvement. We further propose a closed-loop multi-modal embodied planning model that autoregressively generates plans by taking image observations as input. To facilitate effective learning, we leverage MiniGPT-4 with a frozen visual encoder and LLM, and finetune additional vision adapter and Q-former to enable fine-grained spatial perception for manipulation tasks. We conduct experiments to verify the superiority over existing open and closed-loop methods, and achieve a significant increase in success rate by 21.4% and 14.5% over ChatGPT and GPT-4 based robot tasks. Real-world demos are shown in https://www.youtube.com/watch?v=ayAzID1_qQk

    3,5-Dimethyl-1H-pyrazole–2-hy­droxy-5-(phenyl­diazen­yl)benzoic acid (1/1)

    Get PDF
    There are two independent 3,5-dimethyl­pyrazole and two independent 2-hy­droxy-5-(phenyl­diazen­yl)benzoic acid mol­ecules [in which intra­molecular O—H⋯O bonds form S(6) graph-set motifs] in the asymmetric unit of the title compound, C5H8N2·C13H10N2O3. In the crystal, the components are linked by inter­molecular O—H⋯O, O—H⋯N and N—H⋯O hydrogen bonds, forming four-component clusters. Further stabilization is provided by weak C—H⋯π inter­actions

    MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

    Full text link
    We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion.Comment: Accepted by CVPR 202
    corecore